Week 1

Machine Learning definitions:

Arthur Samuel (1959). ML: Field of study that gives computers the ability to learn without being explicitly programmed.

Tom Mitchell (1998). Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some tak T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

For example: Suppose your email program watches which emails you do or do nor mark as spam, and based on that learns how to better filter spam.

  • The task T is: Classifying emails as spam or not spam
  • The experience E is: Watching you label emails as spam or not spam
  • The performance measure P: the number (or fraction) of emails correctly classified as spam/not spam

Example: playing checkers.

  • E = the experience of playing many games of checkers
  • T = the task of playing checkers.
  • P = the probability that the program will win the next game.

In general, any machine learning problem can be assigned to one of two broad classifications:

Supervised learning and Unsupervised learning.

Supervised Learning

In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output.

Supervised learning problems are categorized into "regression" and "classification" problems. In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories.

Example 1:

Given data about the size of houses on the real estate market, try to predict their price. Price as a function of size is a continuous output, so this is a regression problem.

We could turn this example into a classification problem by instead making our output about whether the house "sells for more or less than the asking price." Here we are classifying the houses based on price into two discrete categories.

Example 2:

(a) Regression - Given a picture of a person, we have to predict their age on the basis of the given picture

(b) Classification - Given a patient with a tumor, we have to predict whether the tumor is malignant or benign.

Unsupervised Learning

Cocktail party problem

There's a party, room full of people, all sitting around, all talking at the same time and there are all these overlapping voices because everyone is talking at the same time, and it is almost hard to hear the person in front of you.

So maybe at a cocktail party with two people, two people talking at the same time, and it's a somewhat small cocktail party. And we're going to put two microphones in the room so there are microphones, and because these microphones are at two different distances from the speakers, each microphone records a different combination of these two speaker voices.

Maybe speaker one is a little louder in microphone one and maybe speaker two is a little bit louder on microphone 2 because the 2 microphones are at different positions relative to the 2 speakers, but each microphone would cause an overlapping combination of both speakers' voices.

So here's an actual recording of two speakers recorded by a researcher. Let me play for you the first, what the first microphone sounds like.

One (uno), two (dos), three (tres), four (cuatro), five (cinco), six (seis), seven (siete), eight (ocho), nine (nueve), ten (y diez).

All right, maybe not the most interesting cocktail party, there's two people counting from one to ten in two languages but you know. What you just heard was the first microphone recording, here's the second recording.

Uno (one), dos (two), tres (three), cuatro (four), cinco (five), seis (six), siete (seven), ocho (eight), nueve (nine) y diez (ten).

So we can do, is take these two microphone recorders and give them to an Unsupervised Learning algorithm called the cocktail party algorithm, and tell the algorithm - find structure in this data for you. And what the algorithm will do is listen to these audio recordings and say, you know it sounds like the two audio recordings are being added together or that have being summed together to produce these recordings that we had. Moreover, what the cocktail party algorithm will do is separate out these two audio sources that were being added or being summed together to form other recordings and, in fact, here's the first output of the cocktail party algorithm.

One, two, three, four, five, six, seven, eight, nine, ten.

So, I separated out the English voice in one of the recordings. And here's the second of it.

Uno, dos, tres, quatro, cinco, seis, siete, ocho, nueve y diez.

Not too bad, to give you one more example, here's another recording of another similar situation, here's the first microphone:

One, two, three, four, five, six, seven, eight, nine, ten.

OK so the poor guy's gone home from the cocktail party and he's now sitting in a room by himself talking to his radio. Here's the second microphone recording.

One, two, three, four, five, six, seven, eight, nine, ten.

When you give these two microphone recordings to the same algorithm, what it does, is again say, you know, it sounds like there are two audio sources, and moreover, the album says, here is the first of the audio sources I found.

One, two, three, four, five, six, seven, eight, nine, ten.

So that wasn't perfect, it got the voice, but it also got a little bit of the music in there. Then here's the second output to the algorithm.

Not too bad, in that second output it managed to get rid of the voice entirely. And just, you know, cleaned up the music, got rid of the counting from one to ten.

So you might look at an Unsupervised Learning algorithm like this and ask how complicated this is to implement this. It seems like in order to, you know, build this application, it seems like to do this audio processing you need to write a ton of code or maybe link into like a bunch of synthesizer Java libraries that process audio, seems like a really complicated program, to do this audio, separating out audio and so on.

It turns out the algorithm, to do what you just heard, that can be done with one line of code - shown right here.

[W,s,v] = svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');

It take researchers a long time to come up with this line of code. I'm not saying this is an easy problem, But it turns out that when you use the right programming environment, many learning algorithms can be really short programs.

So this is also why in this class we're going to use the Octave programming environment.

Octave, is free open source software, and using a tool like Octave or Matlab, many learning algorithms become just a few lines of code to implement. Later in this class, I'll just teach you a little bit about how to use Octave and you'll be implementing some of these algorithms in Octave. Or if you have Matlab you can use that too.

It turns out the Silicon Valley, for a lot of machine learning algorithms, what we do is first prototype our software in Octave because software in Octave makes it incredibly fast to implement these learning algorithms.

Here each of these functions like for example the SVD function that stands for singular value decomposition; but that turns out to be a linear algebra routine, that is just built into Octave.

If you were trying to do this in C++ or Java, this would be many many lines of code linking complex C++ or Java libraries. So, you can implement this stuff as C++ or Java or Python, it's just much more complicated to do so in those languages.

What I've seen after having taught machine learning for almost a decade now, is that, you learn much faster if you use Octave as your programming environment, and if you use Octave as your learning tool and as your prototyping tool, it'll let you learn and prototype learning algorithms much more quickly.

And in fact what many people will do to in the large Silicon Valley companies is in fact, use an algorithm like Octave to first prototype the learning algorithm, and only after you've gotten it to work, then you migrate it to C++ or Java or whatever. It turns out that by doing things this way, you can often get your algorithm to work much faster than if you were starting out in C++.

So, I know that as an instructor, I get to say "trust me on this one" only a finite number of times, but for those of you who've never used these Octave type programming environments before, I am going to ask you to trust me on this one, and say that you, you will, I think your time, your development time is one of the most valuable resources.

And having seen lots of people do this, I think you as a machine learning researcher, or machine learning developer will be much more productive if you learn to start in prototype, to start in Octave, in some other language.

Finally, to wrap up this video, I have one quick review question for you.

We talked about Unsupervised Learning, which is a learning setting where you give the algorithm a ton of data and just ask it to find structure in the data for us.

For example:

  • Given a set of news articles found on the web, group them into sets of articles about the same stories.
  • Given a database of customers data, automatically discover market segments and groups customers into different market segments.

So, hopefully, you've remembered the spam folder problem. If you have labeled data, you know, with spam and non-spam e-mail, we'd treat this as a Supervised Learning problem.

The news story example, that's exactly the Google News example that we saw in this video, we saw how you can use a clustering algorithm to cluster these articles together so that's Unsupervised Learning.

The market segmentation example I talked a little bit earlier, you can do that as an Unsupervised Learning problem because I am just gonna get my algorithm data and ask it to discover market segments automatically.

And the final example, diabetes, well, that's actually just like our breast cancer example from the last video. Only instead of, you know, good and bad cancer tumors or benign or malignant tumors we instead have diabetes or not and so we will use that as a supervised, we will solve that as a Supervised Learning problem just like we did for the breast tumor data.

So, that's it for Unsupervised Learning and in the next video, we'll delve more into specific learning algorithms and start to talk about just how these algorithms work and how we can, how you can go about implementing them.

Unsupervised Learning

Unsupervised learning allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don't necessarily know the effect of the variables.

We can derive this structure by clustering the data based on relationships among the variables in the data.

With unsupervised learning there is no feedback based on the prediction results.

Example:

Clustering: Take a collection of 1,000,000 different genes, and find a way to automatically group these genes into groups that are somehow similar or related by different variables, such as lifespan, location, roles, and so on.

Non-clustering: The "Cocktail Party Algorithm", allows you to find structure in a chaotic environment. (i.e. identifying individual voices and music from a mesh of sounds at a cocktail party).

1. A computer program is said to learn from experience E with respect to some task T and some performance measure P if its performance on T, as measured by P, improves with experience E. Suppose we feed a learning algorithm a lot of historical weather data, and have it learn to predict weather. What would be a reasonable choice for P?

The probability of it correctly predicting a future date's weather.

2. Suppose you are working on weather prediction, and your weather station makes one of three predictions for each day's weather: Sunny, Cloudy or Rainy. You'd like to use a learning algorithm to predict tomorrow's weather. Would you treat this as a classification or a regression problem?

Classification

3. Suppose you are working on stock market prediction, and you would like to predict the price of a particular stock tomorrow (measured in dollars). You want to use a learning algorithm for this.Would you treat this as a classification or a regression problem?

Regression

4. (NOT RIGHT) Some of the problems below are best addressed using a supervised learning algorithm, and the others with an unsupervised learning algorithm. Which of the following would you apply supervised learning to? (Select all that apply.) In each case, assume some appropriate dataset is available for your algorithm to learn from.

Examine a large collection of emails that are known to be spam email, to discover if there are sub-types of spam mail.

Given 50 articles written by male authors, and 50 articles written by female authors, learn to predict the gender of a new manuscript's author (when the identity of this author is unknown).

Given historical data of children's ages and heights, predict children's height as a function of their age.

5. Which of these is a reasonable definition of machine learning?

Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed.

Model and Cost Function

Linear regression.

Data set of housing prices from the city of Portland, Oregon, and plot the data set of a number of houses that were different sizes that were sold for a range of different prices.

Let's say that given this data set, you have a friend that's trying to sell a house and let's see if friend's house is size of 1250 square feet and you want to tell them how much they might be able to sell the house for.

Well one thing you could do is fit a model. Maybe fit a straight line to this data. Looks something like that and based on that, maybe you could tell your friend that let's say maybe he can sell the house for around $220,000.

So this is an example of a supervised learning algorithm. And it's supervised learning because we're given the, quotes, "right answer" for each of our examples. Namely we're told what was the actual house, what was the actual price of each of the houses in our data set were sold for and moreover, this is an example of a regression problem where the term regression refers to the fact that we are predicting a real-valued output namely the price.

And just to remind you the other most common type of supervised learning problem is called the classification problem where we predict discrete-valued outputs such as if we are looking at cancer tumors and trying to decide if a tumor is malignant or benign. So that's a zero-one valued discrete output.

More formally, in supervised learning, we have a data set and this data set is called a training set. So for housing prices example, we have a training set of different housing prices and our job is to learn from this data how to predict prices of the houses.

Let's define some notation that we're using throughout this course. We're going to define quite a lot of symbols.

  • m = Number of training examples
  • x's = 'input' variable/feature
  • y's = 'output' variable/feature
  • (x,y) = one training example
  • $(x^{(i)}, y^{(i)})$ = ith training example (i here is not exponentiation! It is an index)

In this data set, if I have, you know, let's say 47 rows in this table. Then I have 47 training examples and m equals 47. Let me use lowercase x to denote the input variables often also called the features. That would be the x is here, it would the input features. And I'm gonna use y to denote my output variables or the target variable which I'm going to predict and so that's the second column here.

(x, y) to denote a single training example. So, a single row in this table (ex: 2104, 460) corresponds to a single training example and to refer to a specific training example.

x(i),y(i) to refer to the ith training example.

So for example:

$x^{(1)} = 2104$

$x^{(2)} = 1416$

$y^{(1)} = 460$

Here's how this supervised learning algorithm works. We saw that with the training set like our training set of housing prices and we feed that to our learning algorithm. Is the job of a learning algorithm to then output a function which by convention is usually denoted lowercase h and h stands for hypothesis.

And what the job of the hypothesis is, is, is a function that takes as input the size of a house like maybe the size of the new house your friend's trying to sell so it takes in the value of x and it tries to output the estimated value of y for the corresponding house.

So h is a function that maps from x's to y's.

In machine learning, hypotesis is a name that was used in the early days of machine learning and it kinda stuck. 'Cause maybe not a great name for this sort of function, for mapping from sizes of houses to the predictions, that you know.... I think the term hypothesis, maybe isn't the best possible name for this, but this is the standard terminology that people use in machine learning. So don't worry too much about why people call it that. When designing a learning algorithm, the next thing we need to decide is how do we represent this hypothesis h.

I'm going to choose our initial choice , for representing the hypothesis, will be the following. We're going to represent h as follows.

And we will write this as

$h_\theta(x) = \theta_0 + \theta_1x$

And as a shorthand, sometimes instead of writing: $h(x)$

Cost Function - Intuition I

In linear progression, we have a training set that I showed here remember on notation M was the number of training examples, so maybe m equals 47. And the form of our hypothesis, which we use to make predictions is this linear function.

To introduce a little bit more terminology, these theta zero and theta one, they stabilize what I call the parameters of the model. And what we're going to do in this video is talk about how to go about choosing these two parameter values, theta 0 and theta 1.

With different choices of the parameter's theta 0 and theta 1, we get different hypothesis, different hypothesis functions. I know some of you will probably be already familiar with what I am going to do on the slide, but just for review, here are a few examples. If theta 0 is 1.5 and theta 1 is 0, then the hypothesis function will look like this.

Because your hypothesis function will be h of x equals 1.5 plus 0 times x which is this constant value function which is phat at 1.5. If theta0 = 0, theta1 = 0.5, then the hypothesis will look like this, and it should pass through this point 2,1 so that you now have h(x). Or really h of theta(x), but sometimes I'll just omit theta for brevity. So h(x) will be equal to just 0.5 times x, which looks like that. And finally, if theta zero equals one, and theta one equals 0.5, then we end up with a hypothesis that looks like this. Let's see, it should pass through the two-two point.

In linear regression, we have a training set, like maybe the one I've plotted here. What we want to do, is come up with values for the parameters theta zero and theta one so that the straight line we get out of this, corresponds to a straight line that somehow fits the data well, like maybe that line over there. So, how do we come up with values, theta zero, theta one, that corresponds to a good fit to the data?

The idea is we get to choose our parameters theta 0, theta 1 so that h of x, meaning the value we predict on input x, that this is at least close to the values y for the examples in our training set, for our training examples.

So in our training set, we've given a number of examples where we know X decides the wholes and we know the actual price is was sold for. So, let's try to choose values for the parameters so that, at least in the training set, given the X in the training set we make reason of the active predictions for the Y values. Let's formalize this.

So linear regression, what we're going to do is, I'm going to want to solve a minimization problem. So I'll write minimize over theta0 theta1. And I want this to be small. I want the difference between h(x) and y to be small.

And one thing I might do is try to minimize the square difference between the output of the hypothesis and the actual price of a house. Okay. So lets find some details.

You remember that I was using the notation (x(i),y(i)) to represent the ith training example. So what I want really is to sum over my training set, something i = 1 to m, of the square difference between, this is the prediction of my hypothesis when it is input to size of house number i. Minus the actual price that house number I was sold for, and I want to minimize the sum of my training set, sum from I equals one through M, of the difference of this squared error, the square difference between the predicted price of a house, and the price that it was actually sold for.

And just remind you of notation, m here was the size of my training set. So m there is my number of training examples. Right that hash sign is the abbreviation for number of training examples. And to make some of our, make the math a little bit easier, I'm going to actually look at we are 1 over m times that so let's try to minimize my average minimize one over 2m. Putting the 2 at the constant one half in front, it may just sound the math probably easier so minimizing one-half of something, right, should give you the same values of the process, theta 0 theta 1, as minimizing that function.

$J(\theta_0,\theta_1) = \frac{1}{2m} \sum_{i=1}^{m}(ŷ_i - y_i)^2 = \frac{1}{2m} \sum_{i=1}^{m}(h_\theta(x_i) - y_i)^2$

$h_\theta(x) = \theta_0 + \theta_1x$

That is equal to this plus theta one xi. And this notation, minimize over theta 0 theta 1, this means you'll find me the values of theta 0 and theta 1 that causes this expression to be minimized and this expression depends on theta 0 and theta 1, okay?

So just a recap. We're closing this problem as, find me the values of theta zero and theta one so that the average, the 1 over the 2m, times the sum of square errors between my predictions on the training set minus the actual values of the houses on the training set is minimized. So this is going to be my overall objective function for linear regression.

Cost function = $J(\theta_0,\theta_1)$

And what I want to do is minimize over theta0 and theta1. My function j(theta0, theta1). Just write this out. This is my cost function. So, this cost function is also called the squared error function.

When sometimes called the squared error cost function and it turns out that why do we take the squares of the erros. It turns out that these squared error cost function is a reasonable choice and works well for problems for most regression programs. There are other cost functions that will work pretty well. But the square cost function is probably the most commonly used one for regression problems.

So far we've just seen a mathematical definition of this cost function. In case this function j of theta zero, theta one. In case this function seems a little bit abstract, and you still don't have a good sense of what it's doing, in the next video, in the next couple videos, I'm actually going to go a little bit deeper into what the cause function "J" is doing and try to give you better intuition about what is computing and why we want to use it.

To recap, here's what we had last time. We want to fit a straight line to our data, so we had this formed as a hypothesis with these parameters theta zero and theta one, and with different choices of the parameters we end up with different straight line:

In pictures what this means is that if theta zero equals zero that corresponds to choosing only hypothesis functions that pass through the origin, that pass through the point (0, 0). Using this simplified definition of a hypothesizing cost function let's try to understand the cost function concept better.

It turns out that two key functions we want to understand. The first is the hypothesis function, and the second is a cost function. So, notice that the hypothesis, right, H of X. For a face value of theta one, this is a function of X. So the hypothesis is a function of, what is the size of the house X. In contrast, the cost function, J, that's a function of the parameter, theta one, which controls the slope of the straight line. Let's plot these functions and try to understand them both better. Let's start with the hypothesis. On the left, let's say here's my training set with three points at (1, 1), (2, 2), and (3, 3). Let's pick a value theta one, so when theta one equals one, and if that's my choice for theta one, then my hypothesis is going to look like this straight line over here. And I'm gonna point out, when I'm plotting my hypothesis function. X-axis, my horizontal axis is labeled X, is labeled you know, size of the house over here. Now, of temporary, set theta one equals one, what I want to do is figure out what is J(theta1), when theta one equals one.

So let's go ahead and compute what the cost function has for. You'll devalue one.

$J_{(\theta_1)} = J_{(0.5)} = 0.58$.

O que isso significa?

Para $\theta_1 = 0$

$J_{(0)} = \frac{1}{2m}[(1 - 0)^2 + (2 - 0)^2 + (3 - 0)^2] = \frac{1}{6}[1 + 4 + 9] = \frac{14}{6}$

For each value of theta one corresponds to a different hypothesis, or to a different straight line fit on the left. And for each value of theta one, we could then derive a different value of j of theta one.

  • $\theta_1 = 1$ (ver gráfico à direita) corresponde a reta azul claro.
  • $\theta_1 = 0.5$ (ver gráfico à direita) corresponde a reta magenta.
  • $\theta_1 = 0$ (ver gráfico à direita) corresponde a reta azul escura.

We want to choose the value of theta one that minimizes J of theta one. This was our objective function for the linear regression. Looking at this curve, the value that minimizes j of theta one is theta one equals to one.

Thus as a goal, we should try to minimize the cost function. In this case, θ1=1 is our global minimum.

And low and behold, that is indeed the best possible straight line fit through our data, by setting theta one equals one. And just, for this particular training set, we actually end up fitting it perfectly. And that's why minimizing j of theta one corresponds to finding a straight line that fits the data well.

So, to wrap up. In this video, we looked up some plots. To understand the cost function. To do so, we simplify the algorithm. So that it only had one parameter theta one. And we set the parameter theta zero to be only zero. In the next video. We'll go back to the original problem formulation and look at some visualizations involving both theta zero and theta one. That is without setting theta zero to zero. And hopefully that will give you, an even better sense of what the cost function j is doing in the original linear regression formulation.

Cost Function - Intuition II

This video assumes that you're familiar with contour plots.

Here's our problem formulation as usual, with the hypothesis parameters, cost function, and our optimization objective.

Hypothesis: $h_\theta(x) = \theta_0 + \theta_1x$

Parameters: $\theta_0, \theta_1$

Cost Function: $J(\theta_0,\theta_1) = \frac{1}{2m} \sum_{i=1}^{m}(ŷ_i - y_i)^2 = \frac{1}{2m} \sum_{i=1}^{m}(h_\theta(x_i) - y_i)^2$

Goal: $\begin{matrix}minimize\\\theta_0, \theta_1\end{matrix}J(\theta_0,\theta_1)$

I'm going to keep both of my parameters, theta zero, and theta one, as we generate our visualizations for the cost function.

So, same as last time, we want to understand the hypothesis H and the cost function J.

So, here's my training set of housing prices and let's make some hypothesis.

This is not a particularly good hypothesis.

But, if I set theta zero=50 and theta one=0.06, then I end up with this hypothesis down here and that corresponds to that straight line.

Now given these value of theta zero and theta one, we want to plot the corresponding cost function on the right.

What we did last time was when we only had theta one.

In other words, drawing plots that look like this as a function of theta one.

But now we have two parameters, theta zero, and theta one, and so the plot gets a little more complicated.

It turns out that when we have only one parameter, that the parts we drew had this sort of bow shaped function.

Now, when we have two parameters, it turns out the cost function also has a similar sort of bow shape. And, in fact, depending on your training set, you might get a cost function that maybe looks something like this.

So, this is a 3-D surface plot, where the axes are labeled theta zero and theta one.

So as you vary theta zero and theta one, the two parameters, you get different values of the cost function J (theta zero, theta one) and the height of this surface above a particular point of theta zero, theta one.

Right, that's the vertical axis. The height of the surface of the points indicates the value of $J(\theta_0,\theta_1$). And you can see it sort of has this bow like shape.

For the purpose of illustration in the rest of this video I'm not actually going to use these sort of 3D surfaces to show you the cost function J, instead I'm going to use contour plots:

Or what I also call contour figures. I guess they mean the same thing. To show you these surfaces.

So here's an example of a contour figure, shown on the right, where the axis are theta zero and theta one. And what each of these ovals, what each of these ellipsis shows is a set of points that takes on the same value for J(theta zero, theta one). So concretely, for example this, you'll take that point and that point and that point. All three of these points that I just drew in magenta, they have the same value for $J(\theta_0,\theta_1)$.

This is the theta zero, theta one axis but those three have the same Value for $J(\theta_0,\theta_1)$ and if you haven't seen contour plots much before think of, imagine if you will. A bow shaped function that's coming out of my screen. (CONTOUR PLOTS! ;D)

So that the minimum, so the bottom of the bow is this point right there, right? This middle, the middle of these concentric ellipses. And imagine a bow shape that sort of grows out of my screen like this, so that each of these ellipses, you know, has the same height above my screen. And the minimum with the bow, right, is right down there. And so the contour figures is a, is way to, is maybe a more convenient way to visualize my function J.

So, let's look at some examples. Over here, I have a particular point, right? And so this is, with, you know, theta zero equals maybe about 800, and theta one equals maybe a -0.15 . And so this point, right, this point corresponds to one set of pair values of theta zero, theta one and the corresponding, in fact, to that hypothesis, right, theta zero is about 800, that is, where it intersects the vertical axis is around 800, and this is slope of about -0.15.

Now this line is really not such a good fit to the data, right. This hypothesis, h(x), with these values of theta zero, theta one, it's really not such a good fit to the data. And so you find that, it's cost. Is a value that's out here that's you know pretty far from the minimum right it's pretty far this is a pretty high cost because this is just not that good a fit to the data.

Let's look at some more examples. Now here's a different hypothesis that's you know still not a great fit for the data but may be slightly better so here right that's my point that those are my parameters theta zero theta one and so my theta zero value. That's about 360 and my value for theta one is equal to zero. Let's take:

  • $\theta_0 = 360$
  • $\theta_1 = 0$

And this pair of parameters corresponds to that hypothesis, corresponds to flat line, that is,

$h_{(x)} = 360 + 0.x$

So that's the hypothesis. And this hypothesis again has some cost, and that cost is, you know, plotted as the height of the J function at that point.

Let's look at just a couple of examples.

Here's one more, you know, at this value of theta zero, and at that value of theta one, we end up with this hypothesis, h(x) and again, not a great fit to the data, and is actually further away from the minimum.

Last example, this is actually not quite at the minimum, but it's pretty close to the minimum. So this is not such a bad fit to the, to the data, where, for a particular value, of, theta zero. Which, one of them has value, as in for a particular value for theta one. We get a particular h(x). And this is, this is not quite at the minimum, but it's pretty close. And so the sum of squares errors is sum of squares distances between my, training samples and my hypothesis. Really, that's a sum of square distances, right? Of all of these errors. This is pretty close to the minimum even though it's not quite the minimum.

So with these figures I hope that gives you a better understanding of what values of the cost function J, how they are and how that corresponds to different hypothesis and so as how better hypotheses may corresponds to points that are closer to the minimum of this cost function J. Now of course what we really want is an efficient algorithm, right, a efficient piece of software for automatically finding The value of theta zero and theta one, that minimizes the cost function J, right?

And what we, what we don't wanna do is to, you know, how to write software, to plot out this point, and then try to manually read off the numbers, that this is not a good way to do it. And, in fact, we'll see it later, that when we look at more complicated examples, we'll have high dimensional figures with more parameters, that, it turns out, we'll see in a few, we'll see later in this course, examples where this figure, you know, cannot really be plotted, and this becomes much harder to visualize. And so, what we want is to have software to find the value of theta zero, theta one that minimizes this function and in the next video we start to talk about an algorithm for automatically finding that value of theta zero and theta one that minimizes the cost function J.

Supplement Material

A contour plot is a graph that contains many contour lines. A contour line of a two variable function has a constant value at all points of the same line. An example of such a graph is the one to the right below.

Taking any color and going along the 'circle', one would expect to get the same value of the cost function. For example, the three green points found on the green line above have the same value for $J(\theta_0, \theta_1)$ and as a result, they are found along the same line. The circled x displays the value of the cost function for the graph on the left when $\theta_0$ = 800 and $\theta_1$= -0.15. Taking another h(x) and plotting its contour plot, one gets the following graphs:

When $\theta_0$ = 360 and $\theta_1$ = 0, the value of $J(\theta_0, \theta_1)$ in the contour plot gets closer to the center thus reducing the cost function error. Now giving our hypothesis function a slightly positive slope results in a better fit of the data.

The graph above minimizes the cost function as much as possible and consequently, the result of $\theta_1$ and $\theta_0$ tend to be around 0.12 and 250 respectively. Plotting those values on our graph to the right seems to put our point in the center of the inner most 'circle'.

Gradient Descent

We previously defined the cost function J. In this video, I want to tell you about an algorithm called gradient descent for minimizing the cost function J. It turns out gradient descent is a more general algorithm, and is used not only in linear regression. It's actually used all over the place in machine learning. And later in the class, we'll use gradient descent to minimize other functions as well, not just the cost function J for the linear regression.

So in this video, we'll talk about gradient descent for minimizing some arbitrary function J and then in later videos, we'll take this algorithm and apply it specifically to the cost function J that we have defined for linear regression.

So here's the problem setup.

Have some function $J(\theta_0,\theta_1)$

Want $\begin{matrix}min\\\theta_0, \theta_1\end{matrix}J(\theta_0,\theta_1)$

Outline:

  • Start with some $\theta_0,\theta_1$
  • Keep changing $\theta_0,\theta_1$ to reduce $J(\theta_0,\theta_1)$ until we hopefully end up at a minimum.

Going to assume that we have some function J(theta 0, theta 1) maybe it's the cost function from linear regression, maybe it's some other function we wanna minimize. And we want to come up with an algorithm for minimizing that as a function of J(theta 0, theta 1).

Just as an aside it turns out that gradient descent actually applies to more general functions. So imagine, if you have a function that's a function of J, as theta 0, theta 1, theta 2, up to say some theta n, and you want to minimize theta 0:

$J(\theta_0,\theta_1,\theta_2,...\theta_n)$

You minimize over theta 0 up to theta n of this J of theta 0 up to theta n:

$\begin{matrix}min\\\theta_0,...,\theta_n\end{matrix}J(\theta_0,...,\theta_n)$

And it turns our gradient descent is an algorithm for solving this more general problem. But for the sake of brevity, for the sake of succinctness of notation, I'm just going to pretend I have only two parameters throughout the rest of this video. Here's the idea for gradient descent.

What we're going to do is we're going to start off with some initial guesses for theta 0 and theta 1.

Doesn't really matter what they are, but a common choice would be we set:

  • \theta_0 = 0
  • \theta_1 = 0

just initialize them to 0.

What we're going to do in gradient descent is we'll keep changing theta 0 and theta 1 a little bit to try to reduce J(theta 0, theta 1), until hopefully, we wind at a minimum, or maybe at a local minimum.

So let's see in pictures what gradient descent does. Let's say you're trying to minimize this function.

So notice the axes, this is theta 0, theta 1 on the horizontal axes and J is the vertical axis and so the height of the surface shows J and we want to minimize this function. So we're going to start off with theta 0, theta 1 at some point. So imagine picking some value for theta 0, theta 1, and that corresponds to starting at some point on the surface of this function.

So whatever value of theta 0, theta 1 gives you some point here. I did initialize them to 0, 0 but sometimes you initialize it to other values as well.

Now, I want you to imagine that this figure shows a hole. Imagine this is like the landscape of some grassy park, with two hills like so, and I want us to imagine that you are physically standing at that point on the hill, on this little red hill in your park. In gradient descent, what we're going to do is we're going to spin 360 degrees around, just look all around us, and ask, if I were to take a little baby step in some direction, and I want to go downhill as quickly as possible, what direction do I take that little baby step in? If I wanna go down, so I wanna physically walk down this hill as rapidly as possible.

Turns out, that if you're standing at that point on the hill, you look all around and you find that the best direction is to take a little step downhill is roughly that direction. Okay, and now you're at this new point on your hill. You're gonna, again, look all around and say what direction should I step in order to take a little baby step downhill? And if you do that and take another step, you take a step in that direction.

And then you keep going. From this new point you look around, decide what direction would take you downhill most quickly. Take another step, another step, and so on until you converge to this local minimum down here.

Gradient descent has an interesting property.

This first time we ran gradient descent we were starting at this point over here, right? Started at that point over here. Now imagine we had initialized gradient descent just a couple steps to the right. Imagine we'd initialized gradient descent with that point on the upper right. If you were to repeat this process, so start from that point, look all around, take a little step in the direction of steepest descent, you would do that. Then look around, take another step, and so on.

And if you started just a couple of steps to the right, gradient descent would've taken you to this second local optimum over on the right.

So if you had started this first point, you would've wound up at this local optimum, but if you started just at a slightly different location, you would've wound up at a very different local optimum. And this is a property of gradient descent that we'll say a little bit more about later.

So that's the intuition in pictures. Let's look at the math. This is the definition of the gradient descent algorithm.

We're going to just repeatedly do this until convergence, we're going to update my parameter theta j by taking theta j and subtracting from it alpha times this term over here, okay? So let's see, there's lot of details in this equation so let me unpack some of it. First, this notation here, :=, gonna use := to denote assignment, so it's the assignment operator. So briefly, if I write a := b, what this means is, it means in a computer, this means take the value in b and use it overwrite whatever value is a. So this means set a to be equal to the value of b, which is assignment. And I can also do a := a + 1. This means take a and increase its value by one. Whereas in contrast, if I use the equal sign and I write a equals b, then this is a truth assertion.

So if I write a equals b, then I'll asserting that the value of a equals to the value of b. So the left hand side, that's the computer operation, where we set the value of a to a new value. The right hand side, this is asserting, I'm just making a claim that the values of a and b are the same, and so whereas you can write a := a + 1, that means increment a by 1, hopefully I won't ever write a = a + 1 because that's just wrong. a and a + 1 can never be equal to the same values. Okay? So this is first part of the definition.

This alpha here is a number that is called the learning rate.

And what alpha does is it basically controls how big a step we take downhill with creating descent. So if alpha is very large, then that corresponds to a very aggressive gradient descent procedure where we're trying take huge steps downhill and if alpha is very small, then we're taking little, little baby steps downhill. And I'll come back and say more about this later, about how to set alpha and so on.

And finally, this term here, that's a derivative term. I don't wanna talk about it right now, but I will derive this derivative term and tell you exactly what this is later, okay? And some of you will be more familiar with calculus than others, but even if you aren't familiar with calculus, don't worry about it. I'll tell you what you need to know about this term here.

Pause for coffee.

You know, this is - excuse me - a damn fine cup of coffee. I've had I can't tell you how many cups of coffee in my life and this, this is one of the best.

Now, there's one more subtlety about gradient descent which is in gradient descent we're going to update, you know, theta 0 and theta 1, right?

So this update takes place for j = 0 and j = 1, so you're gonna update theta 0 and update theta 1.

And the subtlety of how you implement gradient descent is for this expression, for this update equation, you want to simultaneously update theta 0 and theta 1.

What I mean by that is that in this equation, we're gonna update

  • theta 0 := theta 0 minus something

and update

  • theta 1 := theta 1 minus something.

And the way to implement is you should compute the right hand side, right? Compute that thing for theta 0 and theta 1 and then simultaneously, at the same time, update theta 0 and theta 1, okay? So let me say what I mean by that. This is a correct implementation of gradient descent meaning simultaneous update. So I'm gonna set temp0 equals that, set temp1 equals that so basic compute the right-hand sides, and then having computed the right-hand sides and stored them into variables temp0 and temp1, I'm gonna update theta 0 and theta 1 simultaneously because that's the correct implementation.

In contrast, here's an incorrect implementation that does not do a simultaneous update. So in this incorrect implementation, we compute temp0, and then we update theta 0, and then we compute temp1, and then we update temp1.

!!!!

And the difference between the right hand side and the left hand side implementations is that If you look down here, you look at this step, if by this time you've already updated theta 0, then you would be using the new value of theta 0 to compute this derivative term. And so this gives you a different value of temp1, than the left-hand side, right? Because you've now plugged in the new value of theta 0 into this equation. And so, this on the right-hand side is not a correct implementation of gradient descent, okay? So I don't wanna say why you need to do the simultaneous updates. It turns out that the way gradient descent is usually implemented, which I'll say more about later, it actually turns out to be more natural to implement the simultaneous updates. And when people talk about gradient descent, they always mean simultaneous update.

If you implement the non simultaneous update, it turns out it will probably work anyway. But this algorithm wasn't right. It's not what people refer to as gradient descent, and this is some other algorithm with different properties. And for various reasons this can behave in slightly stranger ways, and so what you should do is really implement the simultaneous update of gradient descent.

So, that's the outline of the gradient descent algorithm. In the next video, we're going to go into the details of the derivative term, which I wrote up but didn't really define. And if you've taken a calculus class before and if you're familiar with partial derivatives and derivatives, it turns out that's exactly what that derivative term is, but in case you aren't familiar with calculus, don't worry about it.

The next video will give you all the intuitions and will tell you everything you need to know to compute that derivative term, even if you haven't seen calculus, or even if you haven't seen partial derivatives before. And with that, with the next video, hopefully we'll be able to give you all the intuitions you need to apply gradient descent.

Supplement Material - Gradient Descent

So we have our hypothesis function and we have a way of measuring how well it fits into the data. Now we need to estimate the parameters in the hypothesis function. That's where gradient descent comes in.

Imagine that we graph our hypothesis function based on its fields θ0 and θ1 (actually we are graphing the cost function as a function of the parameter estimates). We are not graphing x and y itself, but the parameter range of our hypothesis function and the cost resulting from selecting a particular set of parameters.

We put θ0 on the x axis and θ1 on the y axis, with the cost function on the vertical z axis. The points on our graph will be the result of the cost function using our hypothesis with those specific theta parameters. The graph below depicts such a setup.

We will know that we have succeeded when our cost function is at the very bottom of the pits in our graph, i.e. when its value is the minimum. The red arrows show the minimum points in the graph.

The way we do this is by taking the derivative (the tangential line to a function) of our cost function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost function in the direction with the steepest descent. The size of each step is determined by the parameter α, which is called the learning rate.

For example, the distance between each 'star' in the graph above represents a step determined by our parameter α. A smaller α would result in a smaller step and a larger α results in a larger step. The direction in which the step is taken is determined by the partial derivative of J(θ0,θ1). Depending on where one starts on the graph, one could end up at different points. The image above shows us two different starting points that end up in two different places.

The gradient descent algorithm is:

repeat until convergence:

$\theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1)$

where

j=0,1 represents the feature index number.

At each iteration j, one should simultaneously update the parameters θ1,θ2,...,θn. Updating a specific parameter prior to calculating another one on the j(th) iteration would yield to a wrong implementation.

Gradient Descent Intuition

In the previous video, we gave a mathematical definition of gradient descent. Let's delve deeper and in this video get better intuition about what the algorithm is doing and why the steps of the gradient descent algorithm might make sense.

Here's a gradient descent algorithm that we saw last time and just to remind you this parameter, or this term alpha is called the learning rate. And it controls how big a step we take when updating my parameter theory J.

$\theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1)$

And this second term here is the derivative term.

And what I wanna do in this video is give you that intuition about what each of these two terms is doing and why when put together, this entire update makes sense. In order to convey these intuitions, what I want to do is use a slightly simpler example, where we want to minimize the function of just one parameter. So say we have a cost function, j of just one parameter, theta one, like we did a few videos back, where theta one is a real number. So we can have one d plots, which are a little bit simpler to look at. Let's try to understand what gradient decent would do on this function.

So let's say, here's my function, J of theta 1. And so that's mine. And where theta 1 is a real number.

Now, let's have in this slide its grade in descent with theta one at this location. So imagine that we start off at that point on my function.

What grade in descent would do is it will update.

Theta one gets updated as theta one minus alpha times d d theta one J of theta one. And as an aside, this derivative term, right, if you're wondering why I changed the notation from these partial derivative symbols. If you don't know what the difference is between these partial derivative symbols and the dd theta, don't worry about it. Technically in mathematics you call this a partial derivative and call this a derivative, depending on the number of parameters in the function J. But that's a mathematical technicality. And so for the purpose of this lecture, think of these partial symbols and d, d theta 1, as exactly the same thing. And don't worry about what the real difference is. I'm gonna try to use the mathematically precise notation, but for our purposes these two notations are really the same thing. And so let's see what this equation will do. So we're going to compute this derivative, not sure if you've seen derivatives in calculus before, but what the derivative at this point does, is basically saying, now let's take the tangent to that point, like that straight line, that red line, is just touching this function, and let's look at the slope of this red line. That's what the derivative is, it's saying what's the slope of the line that is just tangent to the function. the slope of a line is just this height divided by this horizontal thing. Now, this line has a positive slope, so it has a positive derivative. And so my update to theta is going to be theta 1, it gets updated as theta 1, minus alpha times some positive number.

Alpha, the learning rate, is always a positive number.

And, so we're going to take theta one is updated as theta one minus something. So I'm gonna end up moving theta one to the left. I'm gonna decrease theta one, and we can see this is the right thing to do cuz I actually wanna head in this direction. You know, to get me closer to the minimum over there.

So, gradient descent so far says we're going the right thing. Let's look at another example. So let's take my same function J, let's try to draw from the same function, J of theta 1. And now, let's say I had to say initialize my parameter over there on the left. So theta 1 is here. I glare at that point on the surface.

Now my derivative term dd theta one J of theta one when you value into that this point, we're gonna look at right the slope of that line, so this derivative term is a slope of this line. But this line is slanting down, so this line has negative slope.

Right. Or alternatively, I say that this function has negative derivative, just means negative slope at that point. So this is less than equals to 0, so when I update theta, I'm gonna have theta. Just update this theta of minus alpha times a negative number.

And so I have theta 1 minus a negative number which means I'm actually going to increase theta, because it's minus of a negative number, means I'm adding something to theta. And what that means is that I'm going to end up increasing theta until it's not here, and increase theta wish again seems like the thing I wanted to do to try to get me closer to the minimum.

So this whole theory of intuition behind what a derivative is doing, let's take a look at the rate term alpha and see what that's doing.

So here's my gradient descent update mural, that's this equation.

And let's look at what could happen if alpha is either too small or if alpha is too large. So this first example, what happens if alpha is too small? So here's my function J, J of theta. Let's all start here. If alpha is too small, then what I'm gonna do is gonna multiply my update by some small number, so end up taking a baby step like that. Okay, so this one step. Then from this new point, I'm gonna have to take another step. But if alpha's too small, I take another little baby step. And so if my learning rate is too small I'm gonna end up taking these tiny tiny baby steps as you try to get to the minimum. And I'm gonna need a lot of steps to get to the minimum and so if alpha is too small gradient descent can be slow because it's gonna take these tiny tiny baby steps and so it's gonna need a lot of steps before it gets anywhere close to the global minimum.

Now how about if our alpha is too large? So, here's my function J filter, turns out that alpha's too large, then gradient descent can overshoot the minimum and may even fail to convert or even divert, so here's what I mean. Let's say it's all our data there, it's actually close to minimum. So the derivative points to the right, but if alpha is too big, I want to take a huge step. Remember, take a huge step like that. So it ends up taking a huge step, and now my cost functions have strong roots. Cuz it starts off with this value, and now, my values are strong in verse. Now my derivative points to the left, it says I should decrease data. But if my learning is too big, I may take a huge step going from here all the way to out there. So we end up being over there, right? And if my is too big, we can take another huge step on the next elevation and kind of overshoot and overshoot and so on, until you already notice I'm actually getting further and further away from the minimum. So if alpha is to large, it can fail to converge or even diverge.

Now, I have another question for you. So this a tricky one and when I was first learning this stuff it actually took me a long time to figure this out. What if your parameter theta 1 is already at a local minimum, what do you think one step of gradient descent will do?

So let's suppose you initialize theta 1 at a local minimum. So, suppose this is your initial value of theta 1 over here and is already at a local optimum or the local minimum.

It turns out the local optimum, your derivative will be equal to zero. So for that slope, that tangent point, so the slope of this line will be equal to zero and thus this derivative term is equal to zero. And so your gradient descent update, you have theta one cuz I updated this theta one minus alpha times zero. And so what this means is that if you're already at the local optimum it leaves theta 1 unchanged cause its updates as theta 1 equals theta 1. So if your parameters are already at a local minimum one step with gradient descent does absolutely nothing it doesn't your parameter which is what you want because it keeps your solution at the local optimum.

This also explains why gradient descent can converse the local minimum even with the learning rate alpha fixed.

Here's what I mean by that let's look in the example. So here's a cost function J of theta that maybe I want to minimize and let's say I initialize my algorithm, my gradient descent algorithm, out there at that magenta point.

If I take one step in gradient descent, maybe it will take me to that point, because my derivative's pretty steep out there.

Now, I'm at this green point, and if I take another step in gradient descent, you notice that my derivative, meaning the slope, is less steep at the green point than compared to at the magenta point out there. Because as I approach the minimum, my derivative gets closer and closer to zero, as I approach the minimum.

So after one step of descent, my new derivative is a little bit smaller.

So I wanna take another step in the gradient descent. I will naturally take a somewhat smaller step from this green point right there from the magenta point. Now with a new point, a red point, and I'm even closer to global minimum so the derivative here will be even smaller than it was at the green point. So I'm gonna another step in the gradient descent.

Now, my derivative term is even smaller and so the magnitude of the update to theta one is even smaller, so take a small step like so. And as gradient descent runs, you will automatically take smaller and smaller steps.

Until eventually you're taking very small steps, you know, and you finally converge to the to the local minimum.

So just to recap, in gradient descent as we approach a local minimum, gradient descent will automatically take smaller steps. And that's because as we approach the local minimum, by definition the local minimum is when the derivative is equal to zero. As we approach local minimum, this derivative term will automatically get smaller, and so gradient descent will automatically take smaller steps. This is what so no need to decrease alpha or the time.

So that's the gradient descent algorithm and you can use it to try to minimize any cost function J, not the cost function J that we defined for linear regression. In the next video, we're going to take the function J and set that back to be exactly linear regression's cost function, the square cost function that we came up with earlier. And taking gradient descent and this great cause function and putting them together. That will give us our first learning algorithm, that'll give us a linear regression algorithm.

Supplement Material - Gradient Descent Intuition

In this video we explored the scenario where we used one parameter θ1 and plotted its cost function to implement a gradient descent. Our formula for a single parameter was :

Repeat until convergence:

$\theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1)$

Regardless of the slope's sign for $\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1)$, θ1 eventually converges to its minimum value. The following graph shows that when the slope is negative, the value of θ1 increases and when it is positive, the value of θ1 decreases.

On a side note, we should adjust our parameter α to ensure that the gradient descent algorithm converges in a reasonable time. Failure to converge or too much time to obtain the minimum value imply that our step size is wrong.

How does gradient descent converge with a fixed step size α?

The intuition behind the convergence is that $\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1)$ approaches 0 as we approach the bottom of our convex function. At the minimum, the derivative will always be 0 and thus we get:

$\theta_1 := \theta_1 - \alpha*0$

Gradient Descent For Linear Regression

In this video we're gonna put together gradient descent with our cost function, and that will give us an algorithm for linear regression or putting a straight line to our data.

So this was what we worked out in the previous videos.

This gradient descent algorithm which you should be familiar and here's the linear regression model with our linear hypothesis and our squared error cost function.

What we're going to do is apply gradient descent to minimize our squared error cost function.

Now in order to apply gradient descent, in order to, you know, write this piece of code, the key term we need is this derivative term over here.

So you need to figure out what is this partial derivative term and plugging in the definition of the cause function J. This turns out to be this.

$$\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1)$$$$ = \frac{\partial}{\partial\theta_j} \frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2$$$$ = \frac{\partial}{\partial\theta_j} \frac{1}{2m}\sum_{i=1}^{m}(\theta_0+\theta_1x^{(i)} - y^{(i)})^2$$

And all I did there was I took the definition for my hypothesis and plugged it in there. And turns out we need to figure out what is this partial derivative for two cases for J equals 0 and J equals 1.

$j = 0 : \frac{\partial}{\partial\theta_0}J(\theta_0,\theta_1)$ $=\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})$

$j = 1 : \frac{\partial}{\partial\theta_1}J(\theta_0,\theta_1)$ $=\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)}).x^{(i)}$

Demonstrations:

So we want to figure out what is this partial derivative for both the theta 0 case and the theta 1 case. And I'm just going to write out the answers. It turns out this first term is, simplifies to 1/M sum from over my training step of just that of X(i)- Y(i) and for this term partial derivative let's write the theta 1, it turns out I get this term. Minus Y(i) times X(i). Okay and computing these partial derivatives, so we're going from this equation. Right going from this equation to either of the equations down there. Computing those partial derivative terms requires some multivariate calculus. If you know calculus, feel free to work through the derivations yourself and check that if you take the derivatives, you actually get the answers that I got. But if you're less familiar with calculus, don't worry about it and it's fine to just take these equations that were worked out and you won't need to know calculus or anything like that, in order to do the homework so let's implement gradient descent and get back to work. 3:14 So armed with these definitions or armed with what we worked out to be the derivatives which is really just the slope of the cost function j 3:23 we can now plug them back in to our gradient descent algorithm. So here's gradient descent for linear regression which is gonna repeat until convergence, theta 0 and theta 1 get updated as you know this thing minus alpha times the derivative term. 3:39 So this term here. 3:43 So here's our linear regression algorithm. 3:47 This first term here. 3:52 That term is of course just the partial derivative with respect to theta zero, that we worked out on a previous slide. And this second term here, that term is just a partial derivative in respect to theta 1, that we worked out on the previous line. And just as a quick reminder, you must, when implementing gradient descent. There's actually this detail that you should be implementing it so the update theta 0 and theta 1 simultaneously. 4:24 So. Let's see how gradient descent works. One of the issues we saw with gradient descent is that it can be susceptible to local optima. So when I first explained gradient descent I showed you this picture of it going downhill on the surface, and we saw how depending on where you initialize it, you can end up at different local optima. You will either wind up here or here. But, it turns out that that the cost function for linear regression is always going to be a bow shaped function like this. The technical term for this is that this is called a convex function. 5:03 And I'm not gonna give the formal definition for what is a convex function, C, O, N, V, E, X. But informally a convex function means a bowl shaped function and so this function doesn't have any local optima except for the one global optimum. And does gradient descent on this type of cost function which you get whenever you're using linear regression it will always converge to the global optimum. Because there are no other local optimum, global optimum. So now let's see this algorithm in action. 5:38 As usual, here are plots of the hypothesis function and of my cost function j. And so let's say I've initialized my parameters at this value. Let's say, usually you initialize your parameters at zero, zero. Theta zero and theta equals zero. But for the demonstration, in this physical infrontation I've initialized you know, theta zero at 900 and theta one at about -0.1 okay. And so this corresponds to h(x)=-900-0.1x, [the intercept should be +900] is this line, out here on the cost function. Now, if we take one step in gradient descent, we end up going from this point out here, over to the down and left, to that second point over there. And you notice that my line changed a little bit, and as I take another step of gradient descent, my line on the left will change. 6:41 Right? And I've also moved to a new point on my cost function. 6:47 And as I take further steps of gradient descent, I'm going down in cost. So my parameters and such are following this trajectory. 6:57 And if you look on the left, this corresponds with hypotheses. That seem to be getting to be better and better fits to the data 7:08 until eventually I've now wound up at the global minimum and this global minimum corresponds to this hypothesis, which gets me a good fit to the data. 7:21 And so that's gradient descent, and we've just run it and gotten a good fit to my data set of housing prices. And you can now use it to predict, you know, if your friend has a house size 1250 square feet, you can now read off the value and tell them that I don't know maybe they could get $250,000 for their house. Finally just to give this another name it turns out that the algorithm that we just went over is sometimes called batch gradient descent. And it turns out in machine learning I don't know I feel like us machine learning people were not always great at giving names to algorithms. But the term batch gradient descent refers to the fact that in every step of gradient descent, we're looking at all of the training examples. So in gradient descent, when computing the derivatives, we're computing the sums [INAUDIBLE]. So ever step of gradient descent we end up computing something like this that sums over our m training examples and so the term batch gradient descent refers to the fact that we're looking at the entire batch of training examples. And again, it's really not a great name, but this is what machine learning people call it. And it turns out that there are sometimes other versions of gradient descent that are not batch versions, but they are instead. Do not look at the entire training set but look at small subsets of the training sets at a time. And we'll talk about those versions later in this course as well. But for now using the algorithm we just learned about or using batch gradient descent you now know how to implement gradient descent for linear regression. 9:05 So that's linear regression with gradient descent. If you've seen advanced linear algebra before, so some of you may have taken a class in advanced linear algebra. You might know that there exists a solution for numerically solving for the minimum of the cost function j without needing to use an iterative algorithm like gradient descent. Later in this course we'll talk about that method as well that just solves for the minimum of the cost function j without needing these multiple steps of gradient descent. That other method is called the normal equations method. But in case you've heard of that method it turns out that gradient descent will scale better to larger data sets than that normal equation method. And now that we know about gradient descent we'll be able to use it in lots of different contexts and we'll use it in lots of different machine learning problems as well. 9:55 So congrats on learning about your first machine learning algorithm. We'll later have exercises in which we'll ask you to implement gradient descent and hopefully see these algorithms right for yourselves. But before that I first want to tell you in the next set of videos. The first one to tell you about a generalization of the gradient descent algorithm that will make it much more powerful. And I guess I'll tell you about that in the next video.


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]: